Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.
What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?
We want to have a model that is more generalizable if we’re talking about prediction for new data. If you have a very flexible model, you would be able to see trends in the data set at hand, but it won’t be able to predict new or future data very well (i.e., the model perfectly describes patients 1-100, but patient 101 is unknown because the model was built for patients 1-100.) A less flexible model will be better at prediction because it’s only looking at the strongest trends.
Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?
Parametric models are both good and bad dependent on the situation. While they allow for predictions to be made, they are also wrong because these predictions are based on assumptions and result in approximations, inducing error. Non-parametric models speak for themselves with no assumptions made. However, they cannot predict the way a parametric model can.
Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.
The Euclidean distance between points \(p\) and \(q\) with arbitrary number of dimensions is calulated as
\[ {\displaystyle d(\mathbf {p} ,\mathbf {q} )={\sqrt {(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}+\cdots +(p_{i}-q_{i})^{2}+\cdots +(p_{n}-q_{n})^{2}}}={\sqrt {\sum _{i=1}^{n}{(p_{i}-q_{i})^{2}}}}}. \]
In three dimensions (or 3-space) \(n = 3\) and the Euclidean distance is calculated from the respective \(x\), \(y\), and \(z\) components of each points. In R, these distances can be calculated using the function below
### Define a function that takes two
### n-dimensional points as arguments
### and calculates the euclidean distance
### between these points
.dist <- function(point1, point2){
# point1 and point2 should be vectors of n dimension
# check to see that both are numeric vectors
if(!all(is.vector(c(point1, point2), mode = "numeric"))){
stop("Both point1 and point2 must be numeric vectors")
}
# check to see that both points have the same number of dimensions
if(length(point1) != length(point2)){
stop("point1 and point2 must have the same length")
}
# Calculate the distance between point1 and point2
euclidean_distance = sqrt(sum((point1-point2)^2))
return(euclidean_distance)
.dist(point1 = obs1, point2 = obs0)
}
We can then use the function .dist() to compute the distances between the points as shown below
obs0 <- c( 0, 0, 0)
obs1 <- c( 0, 3, 0)
obs2 <- c( 2, 0, 0)
obs3 <- c( 0, 1, 3)
obs4 <- c( 0, 1, 2)
obs5 <- c(-1, 0, 1)
obs6 <- c( 1, 1, 1)
obs7 <- c( 1, 2, 3, 4)
df <- rbind(obs0,obs1,obs2,obs3,obs4,obs5,obs6)
dist <- c(.dist(obs0,obs0), # Compute distance from obs0 to obs0
.dist(obs1,obs0), # Compute distance from obs0 to obs1
.dist(obs2,obs0), # Compute distance from obs0 to obs2
.dist(obs3,obs0), # Compute distance from obs0 to obs3
.dist(obs4,obs0), # Compute distance from obs0 to obs4
.dist(obs5,obs0), # Compute distance from obs0 to obs5
.dist(obs6,obs0)) # Compute distance from obs0 to obs6
Then, we merge the computed distances with the original data frame df and output it as a table using knitr::kable()
df <- data.frame(df,
Y = c("Black","Red","Red","Red","Green","Green","Red"),
Distance = dist)
colnames(df) <- c("$X_1$","$X_2$","$X_3$","$Y$","$D_{(0,0,0)}$")
knitr::kable(df)
| \(X_1\) | \(X_2\) | \(X_3\) | \(Y\) | \(D_{(0,0,0)}\) | |
|---|---|---|---|---|---|
| obs0 | 0 | 0 | 0 | Black | 0.000000 |
| obs1 | 0 | 3 | 0 | Red | 3.000000 |
| obs2 | 2 | 0 | 0 | Red | 2.000000 |
| obs3 | 0 | 1 | 3 | Red | 3.162278 |
| obs4 | 0 | 1 | 2 | Green | 2.236068 |
| obs5 | -1 | 0 | 1 | Green | 1.414214 |
| obs6 | 1 | 1 | 1 | Red | 1.732051 |
Lastly, we can visualize these points in 3-space in a plotly graphic using the code below
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
fig <- plot_ly(df,
x = ~`$X_1$`,
y = ~`$X_2$`,
z = ~`$X_3$`,
color = ~`$Y$`,
colors = c("black",'green', 'red'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'X1', range = c(-5,5)),
yaxis = list(title = 'X2', range = c(-5,5)),
zaxis = list(title = 'X3', range = c(-5,5))))
fig
For this exercise we are only concerned with the first nearest neighbor to point obs0. Since the nearest neighbor is ???? we predict that these points share the same property - thus our prediction for obs0 in this case is ?.
Green from observation 5
For this exercise we are concerned with the three nearest neighbors to point obs0. Since the nearest neighbors are ????, ????, and ???? we predict that the obs0 shares the same property as the majority class - thus our prediction for obs0 in this case is ?.
obs 5, obs 6, and obs 2. The majority of the nearest neighbors are red, so that’s the prediction.
This problem involves writing functions.
Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute \(2^3\) and print out the results.<
Power <- function() {
2^3
}
Power()
## [1] 8
Power2(), that allows you to pass any two numbers, “\(x\)” and “\(a\)”, and prints out the value of “\(x^a\)”.Power2 <- function(x = 3, a = 7){
x ^ a
}
Power2()
## [1] 2187
Power2(5,8)
## [1] 390625
Power2() function that you just wrote, compute \(10^3\), \(81^7\), and \(131^3\).<
Power2(10, 3)
## [1] 1000
Power2(81, 7)
## [1] 2.287679e+13
Power2(131, 3)
## [1] 2248091
Power3(), that actually returns the result “\(x^a\)” as an R object, rather than simply printing it to the screen. That is, if you store the value “\(x^a\)” in an object called “result” within your function, then you can simply return() this result.<
Power3 <- function(x = 3, a = 7){
return(x^a)
}
ans = Power3()
ans
## [1] 2187
Power3() function, create a plot of \(f(x)=x^3\). The x-axis should display a range of integers from 1 to 10, and the y-axis should display \(x^3\). Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale.x <- 1:10
y <- Power3(x, a = 3)
x;y
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 8 27 64 125 216 343 512 729 1000
plot(x=x,
y=y,
xlab = "X",
ylab = "X^3")
plot(x=x,
y=y,
xlab = "X",
ylab = "X^3",
type = "l")
PlotPower(), that allows you to create a plot of “\(x\)” against “\(x^a\)” for a fixed “\(a\)” and for a range of values of “\(x\)”.PlotPower <- function(x,a){
y = x^a
plot(x=x,
y=y,
xlab = "X",
ylab = "X^3")
}
PlotPower(x = 1:10, a = 30)